Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

RNA-Seq Data Analysis ◾ 167

downloaded and used for research purposes or for learning. Most RNA-Seq raw sequence

data are in FASTQ file format. When analyzing the RNA-Seq data, we must pay attention

to the design of the study. For instance, if the purpose is the differential gene expression,

there must be control raw data that we can use for comparison. The control raw data is

determined by the research goal. In conditions like cancers, researchers may use sequenc-

ing raw data of healthy tissue as control against the raw data of the affected tissue and

both from the same individual. However, researchers may also intend to compare gene

expression across individuals or samples. Most researchers include replicate samples in the

design of their study, and thus, there will be multiple raw data for a single sample. Replicate

samples will reduce errors generated by the laboratory technique used and also the possible

errors generated during the sequencing steps.

For practicing, we will use RNA-Seq raw data of a breast cancer study for differential

gene expression in tumor cells. The data is in six FASTQ files (three replicates for tumor

and three replicates for normal) containing paired-end reads of the size 151 bases. For the

sake of simplicity, the files include only the RNA-seq reads of chromosome 22. The data

was adapted to be as simple as possible, so its processing does not take too much time. To

keep the files organized, create a main directory “rnaseq” to be as the project directory and

create inside it the subdirectory “fastq”, and then, inside this subdirectory, download the

raw data from “https://github.com/hamiddi/ngs”. To avoid repetition, assume that the raw

data files have been cleaned from adaptors, duplicates, and the low-quality reads.

5.3.2 Read Mapping

The read mapping follows reprocessing and cleaning of the raw data. The accuracy of anal-

yses depends heavily on the read mapping. The mapping, as discussed in Chapter 2, is the

FIGURE 5.1 RNA-seq data analysis workflow.